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Abstract 

The causal Markov condition (CMC) is a postulate that links observations to causality. It 
describes the conditional independences among the observations that are entailed by a causal hy- 
pothesis in terms of a directed acyclic graph. In the conventional setting, the observations are 
random variables and the independence is a statistical one, i.e., the information content of ob- 
servations is measured in terms of Shannon entropy. We formulate a generalized CMC for any 
kind of observations on which independence is defined via an arbitrary submodular information 
measure. Recently, this has been discussed for observations in terms of binary strings where in- 
formation is understood in the sense of Kolmogorov complexity. Our approach enables us to find 
computable alternatives to Kolmogorov complexity, e.g., the length of a text after applying exist- 
ing data compression schemes. We show that our CMC is justified if one restricts the attention to 
a class of causal mechanisms that is adapted to the respective information measure. Our justifi- 
cation is similar to deriving the statistical CMC from functional models of causality, where every 
variable is a deterministic function of its observed causes and an unobserved noise term. 

Our experiments on real data demonstrate the performance of compression based causal in- 
ference. 



1 Introduction 



Explaining observations in the sense of inferring the underlying causal structure is among the most 
important challenges of scientific reasoning. In practical applications it is generally accepted that 
causal conclusions can be drawn from observing the influence of interventions. The more challenging 
task, however, is to infer causal relations on the basis of non-interventional observations and research 
in this direction still is considered with skepticism. It is therefore important to thoroughly formalize 
the assumptions and discuss the conditions under which they are satisfied. For causal reasoning from 
statistical data, Spirtes, Glymour, Scheines ifTI and Pearl formalized the assumptions under which 
the task is solvable. With respect to a causal hypothesis in terms of a directed acyclic graph (DAG) 
the most basic assumption is the causal Markov condition stating that every variable is conditionally 
independent of its non-descendants, given its parents, 

Xj A. ndj \paj , 
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for short. Pearl argues that this follows from a "functional model" of causality (or non-linear structure 
equations), where every node is a deterministic function of its parents paj and an unobserved noise 
term nj (see Fig.[T]), i.e., 

Xj = fj{paj,nj) . (1) 

The causal Markov condition is then a consequence of the statistical independence of the noise terms, 
which is called causal sufficiency. It can be justified by the assumption that every dependence between 
them requires a common cause (as postulated by Reichenbach Q), which should then explicitly ap- 
pear in the causal model. From a more abstract point of view, condition ([T]) can be interpreted as saying 
that the node xj does not add any more information that is not already contained in the parents and 
the noise together. If we restrict the assumption to discrete variables, the corresponding information 
measure can be, for instance, the Shannon entropy, but also other measures could make sense. 

In H the probabilistic setting is generalized to the case where every observation is formalized by 
a binary string xj (without any statistical population). The information content of an observation is 
then measured using Kolmogorov complexity (also "algorithmic information") which gives rise to an 
algorithmic version of (conditional) mutual information. The corresponding functional model is given 
by a Turing machine that computes the string xj from its pai^ent strings paj and a noise rij. 
The algorithmic information theory based approach generalizes the statistical framework since the 
average algorithmic information content per instance of a sequence of i.i.d. observations converges to 
the Shannon entropy, but on the other hand observations need not be generated by i.i.d. sampling. 

Unfortunately, Kolmogorov complexity is uncomputable and practical causal inference schemes 
must deal with other measures of information. In Section |2] we define general information measures 
and show that they induce independence relations that satisfy the semi-graphoid axioms (Section O. 
Then, in Section |4l we phrase the causal Markov condition within our general setting and explore un- 
der which conditions it is a reasonable postulate. To this end, we formulate an information theoretic 
version of functional models observing that their decisive feature is that the joint information of a 
node, its parents and its noise is the same as the joint information of its parents and noise alone. We 
demonstrate with examples how these functional models restrict the set of allowed causal mechanisms 
to a certain class (Section [51). We emphasize that the choice of the information measure determines 
this class and is therefore the essential prior decision (which certainly requires domain knowledge). 
Thus, when applying our theory to real data, one first has to think about the causal mechanisms to be 
explored and then design an information measure that is sufficiently "powerful" to detect the gener- 
ated dependences. 

Section [6] discusses a modification for known independence based causal inference that is necessary 
for those information measures for which conditioning can only decrease dependences. Section |7] 
describes one of the most important intended applications of our theory, namely information mea- 
sures based on compression schemes (e.g. Lempel-Ziv). Applications of these measures using the 
PC algorithm for causal inference to segments of English text demonstrate the strength of causal rea- 
soning that goes beyond already known applications of compression for the purpose of (hierarchical) 
clustering. 



2 General information measures 

In this section we define information from an axiomatic point of view and prove properties that will 
be useful in the derivation of the causal Markov condition. We start by rephrasing the usual concept of 
measuring statistical dependences. Let A" be a set of discrete-valued random variables and := 2"^ 
be the set of subsets. For each A G O let H{A) denote the joint Shannon entropy of the variables in 
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A. For three disjoint sets A, B, C the conditional mutual information between A and B given C then 
reads 

I{A : B\C) := H{A UC) + H{B U C) - H{A U B U C) - H{C) . (2) 

The set of subsets constitutes a lattice (il, V, A) with respect to the operations of union and intersection 
and H can be seen as a function on this lattic^ll We observe that the non-negativity of (O can be 
guaranteed if 

H{D) + H{E) > H{D y E) + H{D A E) , 

for two sets D,E G $7. This submodularity condition is known to be true for Shannon entropy 
|[5l . Motivated by these remarks, we now introduce an abstract information measure defined on the 
elements of a general lattice. Throughout this paper let (0, A, V) be a finite lattice and denote by 
the meet of all of its elements. 

Definition 1 (information measure) 

We say R : Vt ^ IR_|. is an information measure if it satisfies the following axioms: 

(1) normalization: R{0) = , 

(2) monotonicity: s <t implies -R(s) < R{t) for all s,t € f^, 

(3) submodularity: R{s) + R{t) > R{s V t) + R{s A t)for all s,t eQ. 

Note that submodular functions have been considered in different contexts for example in O and Q. 
Based on R we define a conditional version for all s, i G by 

R{s\t) ■.= R{syt) -R{t). 

For ease of notation we write R{s,t) instead of R{s V t). In analogy to (|2]l, R gives rise to the 
following measure of independence. 

Definition 2 (conditional mutual information) For s,t,u € Q, the conditional mutual information 

of s and t given u is defined by 

I{s:t\u) := R{s,u) + R{t,u) - R{s,t,u) - R{u). 
We say s and t are independent given u or equivalently s l-t\u if I{s : t\u) = 0. 
The following Lemmas generalize usual information theory. 

Lemma 1 (non-negativity of mutual information and conditioning) For s,t,u e Qwe have 
(a) l{s:t\u)>0 and (b) < R{s\t,u) < R{s\t). 

Proof: (a) By definition, /(s : t\u) > is equivalent to R{s,u) + R{t,u) > R{s,t,u) + R{u). 
Defining a = s V n and b = t\J u and using associativity of V we have a\J b = s\J t\J u. Further, 
using Lemma 4 in Ch.l from f8], in any lattice 

a A 6 = (s V n) A (t V ti) > n V (s A i) > u 

'Also the information function tfiat are presented in this paper can all be rephrased as functions on the lattice of subsets 
it is nevertheless notationally convenient to formulate the theory with respect to general lattices. 
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and hence by monotonicity of R: R{a A 6) > R{u). Combining everything 

R{s, u) + R{t, u) = R{a) + R{b) > R{a V 6) + R{a Ab)> R{s, t, u) + R{u), 
where the first inequality uses submodularity of R. 

(b) The first inequality follows from (o) by I{s : s\t, u) > 0. The second inequality follows directly 
from (a) and the definition of /. □ 

Lemma 2 (chain rule for mutual information) For s,t,u,x G Q. 

I{s -.tV u\x) = I{s : t\x) + I{s : u\t, x) (3) 

Proof: This is directly seen by using the definition of conditional mutual information on both sides. □ 

Lemma 3 (data processing inequality) Given s,t,x ^ Q it holds 

R{s\t) =0 ^ I{s: x\t) =0 I{s:x)< I{t : x). 

Proof: The first implication is clear. For the second we apply the chain rule for mutual information 
two times and obtain 

I{s:x) = I{s,t : x) - I{t : x\s) = I{t : x) + I{s : x\t) - I{t : x\s) < I{t : x) , 

since the second summand is zero by assumption and conditional mutual information is non-negative. 
□ 



3 Submodular dependence measures and semi-graphoid axioms 

The axiomatic approach to stochastic independence goes back to Dawid ||9l who stated four axioms of 
conditional independence that are fulfilled for any kind of probability distribution. Later, any relation 
/ on triplets that satisfies the same axioms has been named semi-graphoid in 111. In the following we 
show that the function / constructed from R in the last section satisfies these axioms. 

Lemma 4 (/ satisfies semi-graphoid axioms) The function I defined in the last section satisfies the 
semi-graphoid axioms, namely for x,y,w,z G 0, 

(1) I{x : y\z) =0 =^ I{y : x\z) = (symmetry) 

I lix ! TV 2^) = 

(2) I{x : y,w\z) = ^ =^ s ^/ i \ „ (decomposition) 

y i{x : w\z) = 

(3) I{x : y,w\z) = Q I{x : y\z,w) = Q (weak union) 
X : w\z, y) = 
I{x : y\z) = 

Proof: Symmetry is clear and the remaining implications follow directly from the chain rule and 
non-negativity. □ 

On the contrary, if we are given a function / : x x ^ M+, what axioms do we need 
to define a submodular information measure R from /? It turns out that the chain rule in eq. ^ 



(4) ^ , _ I ( \ ^ l(x:w,y|z)=0 (contraction) 
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together with non-negativity I {a : b\c) > and symmetry I {a : b\c) = I{b : a\c) akeady imphes 
that R{a) := /(a : a|0) is an information measure and / coincides with the dependence measure 
introduced in Definition |2] We omit the proof due to space constraints. 

Thus we characterized the type of dependence measures that we are able to incorporate into our 
framework. To show that the chain rule is actually a strong restriction we close the section with an 
example of independence of orthogonal linear subspaces. 

Example: (orthogonal subspaces, qualitative version e.g. in lHOl ) 

Linear subspaces of some finite vector space form a lattice, where the union of two subspaces is the 
subspace generated by the set-theoretic union. For two such subspaces a and b write TT},{a) for the 
orthogonal projection of a onto b. We define I {a : b) = dim7rb(a), hence orthogonal subspaces are 
considered as independent. Given a third subspace c the conditional version reads 

I{a : b\c) = dim7rb|^^(o|c±), 

where a|^± and 6|g± denote the orthogonal projections of a and 6 to the orthogonal complement c"*- of 
c. / is symmetric and non-negative, but it does not satisfy the chain rule. This is because projecting 
a subspace first to c"*- and then to b^ is different from projecting it to (6 V c)^. Hence we can not 
find a submodular function R underlying our dependence measure. Nevertheless / satisfies the semi- 
graphoid axioms and can thus be considered as a measure of independence. 

4 Causal Markov condition for general information measures 

In this section we define three versions of the causal Markov condition with respect to a general sub- 
modular information measure and show that they are equivalent (similar to the statistical framework). 
Then we discuss under which conditions we expect it to be a reasonable postulate that links observa- 
tions with causality. Assume we are given observations xi, . . . , that are connected by a DAG. It is 
no restriction to consider the observations as elements of a lattice, e.g. the lattice of their subsets. 

Definition 3 (causal Markov condition (CMC), local version) Let G be a DAG that describes the 
causal relations among observations xi, . . . , x^. Then the observations are said to fulfill the causal 
Markov condition with respect to the dependence measure I if 

I{ndj : Xj\paj) = for all 1 < j < k, 

where paj denotes the join of the parents of Xj and ndj the join of its non-descendants (excluding the 
parents). 

The intuitive meaning of the postulate is that conditioning on the direct causes of an observation 
screens off its dependences from all its non-effects. The following theorem generalizes results in ifTOl 
for statistical independences and H for algorithmic independences. In particular it states that if the 
causal Markov condition holds with respect to a graph G, then independence relations implied by the 
CMC can be obtained through the convenient graph-theoretical criterion of d-separation (121, 01). 

Theorem 1 (Equivalence of Markov conditions and information decomposition) 

Let the nodes xi, . . . , x^ of a DAG G be elements of some lattice Q and R be an information measure 
on Q. Then the following three properties are equivalent 
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(1) xi, . . . ,Xk fulfill the (local) causal Markov condition. 

(2) For every ancestral se^ A C {xi, . . . , x^}, R decomposes according to G: 

RiA) = Rixilpai). 

(3) The global Markov condition holds, i.e., if three sets of nodes A,B,C are d-separated in G, 
then 

(v«) i (V") I (v=)- 

\a&A / \h£B / VcGC / 

The proof is provided in Appendix El The second condition shows that the joint information of 
observations can be recursively computed according to the causal structure. The third condition de- 
scribes explicitly which sets of independences ai^e implications of the causal Markov condition. 

Our next Theorem will show that the CMC follows from a general notion of a functional model. 
At its basis is the following Lemma describing that the CMC on a given set of observations can be 
derived from the causal Mai'kov condition with respect to an extended causal graph (see Figui'elD. 

Lemma 5 (causal Markov condition from extended graph) Let the nodes xi, . . . ,Xk of a DAG G 

be elements of a lattice VL with an independence relation I that is monotone and satisfies the chain 
rule. If there exist additional elements ni, . . . , G such that for all j 

I{xj : ndj,n-j \paj,nj) = 0, where n_j = \J rii, (4) 

and the nj are jointly independent in the sense that 

I{nj : n^j) = , (5) 
then the xi, . . . , x^ fulfill the causal Markov condition with respect to G. 

Proof: Based on G we construct a new graph G' with node set {ni, . . . , Uk} U {xi, . . . , x^} and an 
additional edge nj Xj for every j, (1 < j < k). We first show that the causal Markov condition 
holds for the nodes of G': By construction, the join of non-descendants nd'^ of Xj with respect to G' 
is equal to n„j V ndj. Since the join of the parents pa'j of Xj in G' are paj V nj, assumption (HJl just 
states I{xj : nd'j\pa'j) = which is the local CMC with respect to Xj. To see that CMC also holds 
for nj, observe that the non-descendants of Uj are equal to the non-descendants of Xj in G' and since 
Uj does not have any parents, we have to show 

I{nj : nd'j) = 0. (6) 

Using nd'j = n^j V ndj together with the chain rule for mutual information we get 

I{nj : ndj,n-j) = I{nj : n^j) + I{nj : ndj\n^j) = I{nj : ndj\n-j), 



A set A of nodes of a DAG G is called ancestral, if for every v £ A the parents of v are in A too. 
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Figure 1: On the left a causal model of four observations Xi, . . . , X4 is shown together with the 'noise' for 
each node. In Lemma|5]it is shown that the causal Markov condition on this extended graph impUes the CMC 
for xi , . . . , X4. On the right hand side the functional model assumption is illustrated: The generation of Xi from 
its parents pai and the 'noise' does not produce additional information. 

where the last equality follows from Let NDj = {xj-^ , • • • , xj,, } be the set of non-descendants of 
Xj in G. Note that NDj is ancestral, that is if x € NDj, then so are the ancestors of x. We introduce 
a topological order on NDj, such that if there is an edge Xj^ — )■ Xj^^ in G, then Xj^ < Xj^. Using the 
chain rule for mutual information iteratively we get 

kj 

I{nj : ndj\n.j) = ^/(n^ : Xjjxl^\n-j), 

a=l 

where denotes the join of elements of NDj smaller than Xj^. By choice of our ordering the 

mutual information of Uj and Xj^ is conditioned at least on its parents and we can write xj^"^ = 
P'^ja V pttj^ , where pa^^ is the join of elements smaller than Xj^ in NDj that are not its parents. 
Therefore, again by the chain rule, each summand on the right hand side can be bounded from above 
by writing 

I{nj : Xj^ \x''^^\n-j) < l{n_j^,pa]^ : Xj^ \paj^,nj^) 

< I{n^j,, ndj^ : Xj^ \paj^ , J = 0, 

where the second inequality is true because by construction paj^ is the join of non-descents of Xj^. 
The right hand side vanishes because of assumption dH). This proves (O and therefore the causal 
Markov condition with respect to G'. 

By Theorem [H d-separation on G' impUes independence. Due to the special structure of G' one can 
check that d-separation in G implies d-separation in the extended graph G'. Again by Theorem [T] 
d-separation implies the causal Markov condition for G, which proves the lemma. □ 

Now we formalize the intuition that in a generaUzed functional model a node only contains infor- 
mation that is already contained in the direct causes and the noise together (see Figure [Hi: 

Definition 4 (functional model) Let G be a DAG with nodes xi, . . . ,Xk in the lattice Q,. If there 
exists an additional node rij G Qfor each Xj, such that the rij are jointly independent and 

R{xj,paj,nj) = R{paj,nj) for all j, (1 < j < k) (7) 

then G together with ni, . . . , Ji^. is called a functional model of the xi, . . . , x^. 

If we restrict our attention to causal mechanism of the above form, the CMC is justified: 
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Theorem 2 (functional model implies CMC) If there exists afunctional model for the nodes xi, . . . ,Xk 
of a DAG G then they fulfill the causal Markov condition with respect to G. 

Proof: In the functional model with noise nodes nj it holds R{xj,paj,nj) = R{paj,nj) for all j. 
This implies I{ndj : Xj\paj V rij) = 0. Since the nj in a functional model are assumed to be jointly 
independent, Lemma[5]can be applied and proves the theorem. □ 

The following section describes examples of causal mechanisms that can be seen as functional 
models with respect to various information measures. 

5 Examples of information measures and their functional models 

Let xi, . . . , X/fc be a finite set of observations which are in a canonical way elements of the lattice of 
subsets (2'^, U, fl). Let the causal structure be a DAG with xi, . . . , Xfc as nodes. 

5.1 Shannon entropy of random variables 

Let the Xi be discrete random variables with joint probability mass function p{xi, . . . , x^). For a 
subset A C {xi, . . . , Xfc} denote by xa '■= Xj-.g^Xj the random variable with distribution pA ■= 
p{{xi)x^eA)- The Shannon entropy for the subset A is defined as H{A) := —KplogpA- Monotony 
as well as submodularity are well-known properties lH. The conesponding notion of independence 
is the familiar (conditional) stochastic independence, its information-theoretic quantification / being 
mutual information. Then H{xi,pai,ni) = H{pai, rii) is equivalent to the existence of some function 
fi with 

Xi = fi{pai,ni) . 

This restricts the set of mechanisms to those which were deterministic if one could take all latent 
factors into account. Note that continuous Shannon entropy is not monotone under restriction to 
subsets. Nevertheless, in this case the chain rule and non-negativity is true and therefore the CMC can 
be motivated by independences with respect to an extended causal model (Lemma |5] of the previous 
section). 

5.2 Kolmogorov complexity of binary strings 

Let the Xj be binary strings and the information measure be the Kolmogorov complexity as informa- 
tion measure. More explicitly, for a subset of strings ACS denote by xa a concatenation of the 
strings in a prefix free manner (which guarantees that the concatenation can be uniquely decoded into 
its components). The Kolmogorov complexity K{xa) is then defined as the length of the shortest 
program that generates the concatenated string xa on a universal prefix-free Turing machine. It is 
submodular up to a logarithmic constant ifTTI . For two strings s, t the conditional Kolmogorov com- 
plexity K{s\t) of s, given t is defined as the length of the shortest program that computes s from the 
input t. It must be distinguished from K{s\t*), the length of the shortest program that computes s 
from the shortest compression of t. Note that defining R{s) := K{s) implies that the conditional 
information reads R{s\t) = K{s\t*) due to 

K{s,t) = K{t) + K{s\t*), 
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see ini. Then 

K{xi,pai,ni) = K{pai,ni) is equivalent to K{xi\{pai,ni)*) = 0, 

which, in turn, is equivalent to the existence of a program of length 0(1) that computes Xi from 
the shortest compression of {pai,ni). Here we have considered the number k of nodes as a con- 
stant, which ensures that the order of the strings does not matter. Such an "algorithmic model of 
causality"p1| restricts causal influences to computable ones. Uncomputable mechanisms can easily 
be defined (halting problem). However, in the spirit of the Church-Turing thesis, we will assume that 
they don't exist in nature and conjecture that the algorithmic model of causality is the most general 
model of a causal mechanism as long as we restrict the attention to the non-quantum world (where the 
model would probably be replaced with a quantum Turing machine). 

5.3 Period length of time series 

We start with the following abstract example to illustrate that the definition of an information measure 
is more natural when the observations are taken to be part of a lattice different from the lattice of 
subsets. Let every observation be a natural number Xj G N and consider them elements of the lattice 
of natural numbers where V denotes the least common multiple and A the greatest common divisor, 
hence for S C {xi, . . . , Xk} 

xs ■= ^x^eSXi ■■= lcm{{xi)^^(zs) 
We define an information measure by 

R{xs) := logxs . 

Non-negativity and monotonicity of R are clear and submodularity even holds with equality: For 

a,6 G N 

R(a V 6) + R(a Ab) = log lcm(a, b) + log gcdia, b) = log — - — — - + log gcd(a, b) 

gca[a, b) 

= R{a) + R{b). 
The corresponding conditional dependence measure reads 

I{a : b\c) = R[gcd{a, b)/gcd{a, b, c)) = log gcd{a, b) — log gcd{a, b, c), 

so a and b are independent given c if c contains all prime factors that are shared by a and b (with at 
least the same multiplicity). 

We define a functional model where every node Xi contains only prime factors that are already con- 
tained in its parents and its noise node (with at least the same multiplicity) and the noise terms are 
assumed to be relatively prime. 

Such a lattice of observations can occur in real-life if Xj denotes the period length of a periodic 
time series over Z. Then the period length of the joint time series defined by a set of nodes is obviously 
the least common multiple. If every time series at node i is a function Fi of its parents and noise node 
(each being a time series) and Fi is time-covariant, Xi divides their period lengths. 

Assuming that the period lengths of the noise time series are relatively prime is indeed a strong 
restriction, but if we assume that the periods are large numbers and interpret independence in the 
approximate sense 

loglcm{{xi}) ?a log Xj , 
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we obtain the condition that their periods have no large factors in common. This seems to be a 
reasonable assumption if the noise time series have no common cause. 

One can easily think of generalizations where every observation Xj is characterized by a symmetry 
group and the join of nodes by the group intersection describing the joint symmetry. One may then 
define functional models where every node inherits all those symmetries that are shared by all its 
parents and the noise node. 

5.4 Size of vocabulary in a text 

Let every observation x-i be a text and for every collection of texts S C {xi, . . . ,Xk} let R{S) be 
the number of different meaningful words in S. Here, meaningful means that we ignore words like 
articles and prepositions. To see that R is submodular we observe that it is just the number of elements 
of a set. 

We can use R to explore which author has copied parts of the texts written by other authors: Let 
every Xi be written by another author and a causal arrow from Xj to xj means that the author of Xj 
was influenced by Xj when writing Xj. 

The noise can be interpreted as the set of words the author usually uses and the condition 
R{xi,pai,7ii) = R{pai,ni) then means that he/she combines only words from the texts he/she has 
seen with the own vocabulary. 

To conclude this section we want to emphasize that the above example refers to a dependence 
measure that is non-increasing under conditioning, that is for collections S, T, U and V of texts I{S : 
T\U) > I{S : T\V) whenever U C V. This is because I{S : T\U) is equal to the number of 
meaningful words contained in S and T, but not in U. In general, the above information measure can 
be viewed as rank or height function (cardinality) on the lattice of sets of meaningful words and it can 
be shown that dependence measures originating form information functions that are rank functions on 
distributive lattices are always non-increasing under conditioning H. We will elaborate on this point in 
the next section because it imposes special challenges for causal inference. 

6 Faithfulness for monotone dependence measures 

Apart from the CMC, the essential postulate of independence based causal inference is usually causal 
faithfulness. It states that all observed independence relations are structural, that is, they are induced 
by the true causal DAG through d-separation. This postulate allows the identification of causal DAGs 
up to "Markov equivalence classes" imposing the same independences. 

Faithfulness has already been defined for abstract conditional independence statements and we 
start by rephr^asing the definition following ([IJ, p.81). 

Definition 5 (faithfulness) A DAG G is said to represent a list of conditional independence relations 
C on a set of observations X faithfully, if C consists exactly of the independence relations implied 
by G through d-separation. Further, a set of observations X is said to be faithful (w.r.t. a given 
dependence measure), if there exists a causal model that represents X faithfully. 

The above definition of faithfulness makes sense for the probabiUstic and algorithmic notions of de- 
pendence, but there is a problem with respect to dependence measures on which conditioning can only 
decrease information. As mentioned above, rank functions of distributive lattices lead to this kind of 

^^Lattices with a rank function that is submodular are generally called semimodular lattices. Abstract independence 
measures on semimodular lattices have also been discussed in a different context by Cuzzolin 1131 . 
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dependence measures, that we will call monotone in the following. To see the problem, consider for 
three observations a,b,ca causal model G of the form a — > 6 ^ c. By d-separation, a is independent 
of c and for a monotone dependence measure this implies a X c\b, which is not an independence 
induced by d-separation. Hence, G does not faithfully represent the objects and one can easily check 
that a faithful representation does not exist (e.g. using the theorem below). However, we can mod- 
ify faithfulness such that it also accounts for those independences that follow from monotony under 
conditioning: 

Definition 6 (monotone faithfulness) A DAG G is said to represent a list C of conditional indepen- 
dences of observations X monotonely faithful, if the following condition is true for all disjoint subsets 
S,T,U C X whose join is denoted by s,t and u: Whenever s i t\uis in C and u is minimal among 
all the sets that render s and t independent, then s and t are d-separated by u in G. Further, a set of 
observations X is said to be monotonely faithful (w.rt. a given dependence measure), if there exists a 
causal model that represents X monotonely faithful. 

Note that, trivially, every faithful representation is a monotonely faithful representation, hence faithful 
observations are monotonely faithful observations. Faithful representations have already been char- 
acterized (Theorem 3.4 in UJ) and we prove an equivalent characterization that holds simultaneously 
for monotonely faithful and for faithful observations. 

Theorem 3 (characterization of monotonely faithful representations) A set of (monotonely ) faith- 
ful observations X is represented (monotonely) faithfully by a DAG G if and only if 

(1) two observations a and b are adjacent in G if and only if they can not be made independent by 
conditioning on any join of observations in X\{a, b}. 

(2) for three observations a, 6, c, such that a is adjacent to b, b is adjacent to c and a is not adjacent 
to c, it holds that a b cin G if and only if there exists a set U C X\{a, b, c} such that a 
is independent ofc given the join of the observations in U. 

The proof is given in the appendix. The theorem implies in particular, that every monotonely faithful 
representation of faithful objects is already a faithful representation. 

The PC algorithm |[T4l [H for causal inference takes a set of conditional independences on faithful 
objects and returns the equivalence class of faithful representations. Since the above theorem is used 
to prove the con^ectness of the algorithm in the faithful case, we conclude that the algorithm con^ectly 
returns monotonely faithful representations given monotonely faithful observations. We apply the 
PC-algorithm with respect to compression based information functions in the following section. Also 
they are not monotone in a strict theoretical sense, empirical observations indicate that it is unlikely 
for the mutual information to increase. 

7 Compression based information 

In this section we demonstrate that our framework enables us to do causal inference on single objects 
(coded as binary strings) without relying on the uncomputable measure of Kolmogorov complexity. 
To this end, instead of defining complexity with respect to a universal Turing machine we explicitly 
limit ourselves to specific production processes of strings. The underlying measure of information is 
motivated by universal compression algorithms like LZ77 |[T5l and grammar based compression |[T6l 
that detect repeated occurrences of identical substrings within a given input string and encode them 
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more efficiently. The choice of a compression scheme can be seen as a prior analogously to the choice 
of a universal Turing machine in the case of algorithmic information. The measures considered in this 
section quantifiy the information of an observation (string) in terms of the diversity of its substrings 
and entail the following assumption on causal processes: A mechanism that produces a string y from a 
string X is considered as simple, if it constructs y by concatenating a small number of substrings from 
X (see Lemma|7]below for a formal statement). Further, the amount of dependence of observations is 
approximately given by the number of substrings that they share. 

We are going to describe two specific measures of information that are closely related to the 
total length of the compressed string, but have better formal properties than the latter. This way our 
conclusions will be independent of the actual implementation of the compression scheme and proving 
theoretical results gets easier. 

In the last part of this section we describe experiments on real data in which the PC algorithm is 
applied to infer the causal structure using either of the two introduced measures of information. 
Note that distance metrics based on compression length have already been used to cluster various 
kinds of data (see ifTTl for computable distance metrics motivated by algorithmic mutual information 
or ifTSl for an application to molecular biology). These metrics can be used to reconstruct trees 
(hierarchical clustering) but if two nodes are linked by more than one path a measure of conditional 
mutual information is needed to reconstruct the data-generation process. To the best of our knowledge, 
compression based methods have not been used before to infer non-tree-like DAGs. 



7.1 Lempel-Ziv information (LZ-information) 

LZ-information has been introduced as a complexity measure for strings in |[T9l . It has been applied 
to quantify the complexity of time series in biomedical signal analysis |[20l and distance measures 
based on versions of LZ-information have been used to analyze neural spike train data ||2TI and to 
reconstruct phylogenetic trees ll22l . We start by defining 

Definition 7 (production and reproduction from prefix) Let s = xy be a string. We say s is repro- 
ducible from its prefix x and write x — t- s if y is a substring of xy, where y is equal to y without its 
last symbol. We say s is producible /rom x and write x ^ s if x ^ s, where s is equal to s without its 
last symbol. 

Contrary to reproducibility, producibility allows for the generation of new substrings, for if x =^ s, 
the last symbol of s can be arbitrary. 



^^^^^^new Example: For a given string s = xy let s be the string without its last symbol. 
s = 0010100 The figure on the left shows that s is producible from its prefix x by copying 
^ y the second symbol of x to the first of y and so on. The string s itself not 
producible from x, but reproducible. 

Informally, LZ-information counts the minimal number of times during the process of parsing the in- 
put string from left to right, in which the string can not be reproduced from its prefix and a production 
step is needed. 

Definition 8 (LZ-information, Ifl9l ) Let s be a string of length n. Denote by Si the i-th symbol of 
s and by s{i,j) the substring SiSj+i • • • Sj. A production history Hs of s is a partition of s into 
substrings s = s{ho, hi)s{hi + 1, ■ ■ ■ s{hk + 1, /ifc+i) with Hq = I and /ifc+i = n, such that 

s{l, hi) s(l, hi^i) for all i Q {1, . . . , k}. 
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A history Hs is called exhaustive if additionally 

s{l, hi) 7^ s(l, /li+i) for all i G {I, . . . ,k — 1}. 

The substrings s{hi + 1, /li+i), (0 < i < fe) will be called components of Hs and the length \Hs \ of 
Hg is defined as the number of its components. 

The LZ-information of s, denoted by c(s), is defined as the length of its (unique) exhaustive history. 

In an exhaustive history, each /i, is chosen maximal such that s(l, /ij — 1) is reproducible from its 
prefix /li-i). As an example, for s = 000100101100110 the exhaustive history partitions s into 

s = (0)(001)(00101)(10011)(0), 

hence c{s) = 5. 

In the original paper of Ziv and Lempel |fT9l it was shown that c is subadditive: for two strings x and y 
the information of the concatenated string xy is at most the information of x plus the information of y. 
This already suggests the non-negative unconditional dependency measure i{x : y) = c{x) + c{y) — 
c{xy). As it turns out, submodularity holds up to a negligible constant independent of the involved 
string lengths: 

Lemma 6 (submodularity of LZ-information, asymmetric version) Let x,y,z be finite strings over 
some alphabet A. Further let a and (3 be symbols not contained in A that will be used as separators. 
Then 

i{x : y\z) := c{zax) + c{zay) — c{zaxl3y) — c{z) > —1. (8) 

Proof: Let Eza be the exhaustive history of za. The exhaustive history of zax is of the form Ezax = 
[Eza: whcrc E'^.j^ dcscribcs the partition of x induced by E^ax- This is because a is not part of 
the alphabet, hence the component in E^ax containing a must be of the form {to) for some substring 
t. Analogously Ezay = [Eza, Ey\z]- It is not difficult to see that 

is a production history of zaxPy. Theorem 1 in 1191 states that a production history is at least as long 
as the exhaustive history, hence 

\[Eza,E^\z,l3,Ey\z]\ > \Ezaxi3y\ = c{zax/3y), 

Further, c{z) < \Eza \ and so (HJ can be bounded from below by 

c{zax) +c{zay) - c{zaxl3y) - c{z) > \[Eza, E.j.\z] \ + \ [Eza,Ey\z] \ - \ [Eza, E^\z, Ey\z] \ - \Ez 

= -1. 

□ 

Lemma 7 (functional model for LZ-information, asymmetric version) Letpui and Ui be two strings 
over an alphabet A and construct a third string string Xi by concatenating k substrings ofpai and rii. 
Then 

c{pai arii 13 Xi) < c{pai a riifi) + k, 
where a and /3 are symbols not in A used as separators. 



13 



Proof: A production history of paioriiPxi can be generated by concatenating the exhaustive history 
of paiariiP with the list of the at most k substrings out of which Xi is constructed. The length of this 
history is c{paiani) + A; + 1 and bounds c{paiani/3xi) from above by Theorem 1 in |[T9l . □ 

In particular, if xy is producible from x, by appending y, the information is at most increased 
by one. Hence, if we restrict the mechanisms that generate a node to consist of a limited number of 
concatenations of substrings from its parents and the independent noise (compared to the amounts of 
information involved) the causal Markov condition would follow if c were an information function. 
This is not the case since c is not defined on sets of strings (in particular it is not symmetric {c{xy) ^ 
c{yx)), therefore we define the LZ-information of a set of strings to be the LZ-information of their 
concatenation with respect to a given order (e.g. lexicographic). 

Definition 9 (LZ-information, set version) Let {xi, . . . , x^} be a set of strings over some alphabet 
A. Choose k distinct symbols ai, . . . ,ak not contained in A that will be used as separators. 
Let X = {xi^ , • • • , Xi^} be a subset and assume Xi^ < Xi^ < . . . < Xi^ with respect to a given order 
on the set of strings over A. We define the LZ-information of X as 

LZ{X) = c(xii Oil • • • Xi^ ai„) , 

where the argument of c is understood as the concatenation of the strings. 

Because of the asymmetry of c, LZ is not monotone and submodular in a strict sense. However, em- 
pirical observations suggest that for sufficiently large strings the violations of submodularity induced 
by the asymmetries like c{xay) ^ c{yax) are negligible compared to the amounts of information. 

Hypothesis: For practical purposes LZ{-) is an information measure up to constants at most log- 
arithmic in the string length. The associated independence measure / is monotonely decreasing 
(through conditioning). 

We close by mentioning that the calculation of the LZ-information is very inefficient for large 
strings since one has to search over all substrings of the part of the string aheady parsed. In our 
implementation we therefore considered only substrings of length limited by a constant (we chose 30 
for strings of English text, since it is unlikely that a substring of length 30 is repeated exactly). 

7.2 Grammar based information 

In the grammar based approach to compression an input string x is transformed into a context-free 
grammar that generates x. This grammar is then compressed for example using arithmetic codes. 
We discuss this approach because it has been successfully applied to compress RNA data (e.g. fT^). 
Further the LZ-based compression discussed in the previous section can be rephrased into this frame- 
work. As there are many grammars that produce a given string, it is essential that the transformation 
of strings to grammars produces economic representations of x (for an overview see |[24l ) We im- 
plemented the so called greedy grammar transform from Yang and Kieffer |[T6l . It constructs the 
grammar- iteratively by parsing the input string x. Due to space restrictions we just give an example 
of a string and its generated grammar. 

Example: The binary string x = 1001110001000 is transformed using the greedy grammar trans- 
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form of fT6l to the grammai^ G{x) : 

So S1IIS2S2 

si 100 

S2 SiO, 

where sq, si and S2 are variables of the grammar and x can be reconstructed by starting from sq and 
then iteratively substituting Si by the right hand side of each production rule above. The length of a 
grammar \G{x) \ is defined as the sum of all symbols on the right of every production rule, so for the 
above example |G(x)| = 10. We view the length of the constructed grammar as information measure 
of the string that it produces and define analog to the LZ-information 

Definition 10 (grammar based information) Let {xi, . . . , x^) be a set of strings over some alpha- 
bet A. Choose k distinct symbols ai, . . . ,ak not contained in A that will be used as separators. 
Let X = {xj^ , . . . , } be a subset and assume Xi^ < Xi^ < . . . < xi^ with respect to a given order 
on the set of strings over A. We define the grammar based information of X as 

GR{X) = \G{xi^ ai,--- Xi^ Oj^) |, 

where the input of the grammar construction G is understood as the concatenation of the strings. 

By definition GR is non-negative and due to the construction process of the grammar it is monotone. 
However, experiments show that submodularity is violated, but the amount of violation still allows to 
draw causal conclusions for sufficiently large strings. 

7.3 Experiments 

This section reports the results on causal inference using the introduced LZ-information and grammar 
based information measures. 

Experiment 1: Markov cliains of English texts 

We start with a string of English text sq from which we construct further strings si , . . . , as follows: 
To generate Sj+i we translate Sj using an automatic translator from Googlqj to a randomly chosen 
European language. Then Sj+i is defined as the string that we obtain when we translate Sj back to 
English using the same translator. Since Sj+i is determined by Si, theprocess can be modeled by 
a 'Markov' chain sq — ?> • • • Sk. We then apply the PC algorithrrlj to infer the corresponding 
equivalence class of (monotonely) faithful causal models consisting of the DAGs: 

So ^ ••• ^ Sj ^> •••—)• Sfc for < i < k. 

In our experiments we chose several starting texts of 1000 to 5000 symbols (e.g. news articles and 
the abstract of this paper) and generated three strings {k = 3) using the described procedure. In every 
string we transformed all non-space characters to numbers 0, . . . , 8 using a modulo operation on the 
ASCII value to reduce the alphabet size. Repeated spaces were deleted and the space character has 
been encoded separately by the number 9 to ensure that words of the string remain separated. 

''accessible at http ://translate . google . de/l 

'Our implementation of the PC algorithm for causal inference was based on the BNT- Toolbox for Matlab written by 
Kevin Murphy and available at http://code.google.eom/p/bnt/. 
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Results: Based on the two information measures, the PC algorithm returned the correct class of 
DAGs in every case. For LZ-information the chosen threshold used to determine independence did 
not even have to depend on the starting texts sq- Grammar- based information seems to be more 
sensitive to the string lengths involved and we had to choose a different threshold for every chosen 
text sq. Further, we successfully tried the method on the chain of preliminary versions of the abstract 
of this paper. 

Finally note that methods based on compression distance could also be applied to recover the 
correct equivalence class. The crucial difference to our approach consists in the fact that we did not 
have to assume that the underlying graph is a tree. 

Experiment 2: Four-node networks 

We want to infer the equivalence classes of (monotonely) faithful causal models depicted in Figures 
(a) and (b) below. To this end we randomly choose segments of a large English text and then construct 
the strings corresponding to the nodes a,b,c and d in a way that ensures the resulting observation 
{a, b, c, d} to be (monotonely) faithful. Explicitely, we choose segments Sx and Sxy for each node 
X and for each edge between nodes x and y respectively. Further, for every ordered triple of nodes 
{x, y, z) whose subgraph is not equal to x y z, we pick a segment Sxyz- This way we obtain the 
following segments with respect to the graph in Figure (a): 

Sai Sfj, Sq, Sd, Sab: ^aci ^bdi ^cdi ^bac: •SaMj ^acd 

and with respect to the graph in Figure (b) we get segments 

•Saj Sb, Scj Sd, Sac: ^adi Sbci Sbdi Scad: Scbd- 

Finally, the string at a node is constructed as the concatenation of all segments that contain the name 
of the node in its index (the order is arbitrary), e.g. in the case of Figure (a) 

b = SbSabSbdSbacSabd- 

As text source we chose an English version of Anna Karenina by Lev Tolstoi B We then trans- 
formed all non-space characters to numbers from 0, . . . , 8 using a modulo operation on the ASCII 
value to reduce the size of the alphabet. Repeated spaces were deleted and the space character has 
been encoded separately by the number 9 to ensure that words of the string remain separated. The re- 
sulting string consisted of a total of approximately two million symbols. Using the above construction, 
we generated 100 observations {a, b, c, d} with respect to each graph and applied the PC algorithm. 
The length N of the randomly chosen segments was chosen uniformly between 100 and 200 in the 
first run and between 300 and 500 in the second run. The choice of the threshold to determine inde- 
pendence depended only on the information measure and on the two possible ranges of N, but not on 
the indivudual observations. Further, the graph of Figure (b) implies an unconditional independence 
of a and b. Since two disjoint segments of English text can not be expected to be independent, we 
conditioned all informations that we calculate on background knowledge in terms of fixed segment of 
length 5000. 

*The text is available at |http://www.gutenberg.org/etext/1399l 
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Correct answers of PC: 

N G [100, 200] 

LZ : 98% 

GR : 53% 



N G [300, 500] 
LZ : 
GR : 



100% 




Correct answers of PC: 

G [100, 200] 
LZ : 
GR : 



N G [300, 500] 
LZ : 
GR : 



95% 
97% 

100% 



Results: Above, the percentages of correct results from the PC-algorithm are shown. Note that using 
LZ-information we were able to recover the correct equivalence class in almost all runs independently 
of the graph structure and segment length. Grammar based inference did not perform quite as well, 
but in the majority of cases in which it did not return the correct Markov equivalence class most of 
the independences still were detected correctly. 



8 Conclusions 

We have introduced conditional dependence measures that originate from submodular measures of 
information. We argued that these notions of conditional dependence (generalizing statistical depen- 
dence) can be used to infer the causal structure among observations even if the latter are not generated 
by i.i.d. sampling. To this end, we formulated a generalized causal Markov condition (with signif- 
icant formal analogies to the statistical one) and proved that the condition is justified provided that 
the attention is restricted to a class of causal mechanisms that depends on the underlying measure of 
information. We demonstrated that existing compression schemes like Lempel-Ziv define interesting 
notions of information and described the class of mechanisms that justify the causal Markov condition 
in this case. Accordingly, we showed that the PC-algorithm successfully infers causal relations among 
texts when based notion of dependence induced by compression schemes. 
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A Proof of Theorem [T] 



(1) (2). Let A C {xi , . . . , Xfc} be an ancestral set with respect to G and denote by Ga the subgraph 
of G with nodes from A. Then the nodes of A fulfill the causal Markov condition with respect to Ga- 
Denote by NDj the set of non-descendants of xj in G. Then we have for all Xj € A 

= I{ndj : Xjlpuj) > I{nd^ : Xj\paj) = I{nd^ : Xj\pa^), 

where pa^ and nd^ denote the join of the parents and non-descendants of xj in Ga- The first equality 
follows from the causal Markov condition with respect to G, the last one uses paj = paf (because A 
is ancestral). 

The remaining proof is by induction on the size \A\oi the ancestral set. For \A\ = 1 it is obvious. 
Assume the statement is true for sets with |A| = A; — 1 nodes and let G be a graph with k nodes on 
which R fulfills the Markov condition. Without loss of generality assume that Xk has no descendants. 
Hence by assumption of the induction 

fc-i 

R{xi, Xk-i) = ^ R{xi\pai). 

i=l 

By definition of conditioning we get 

R{xi, ...,Xk) = Rixi, + R{xk\xi, (9) 
Since Xk was chosen to have no descendants it follows that ndf V paf = xi V . . . V x^^i- Therefore 

R{xk\xi,...,Xk-^i) = R{xk\ndk y pa^) = R{xk\pak) , 

where the last equality follows from the independence of x^ from its non-descendants given its par- 
ents, which is implied by the causal Markov condition with respect to Ga- Using this relation in 
equation Q proves (2). 

(2) (1). We prove the causal Markov condition for every node xj. Using the definition of condi- 
tional mutual information we get 

I{ndj : Xj\paj) = R{ndj,paj) + R{xj\paj) — R{ndj,paj,Xj). 

Denote by NDj and PAj the sets of parent nodes and non-descendant nodes of Xj, respectively. Using 
decomposability of R with respect to the two ancestral sets NDj U PAj and NDj U PAj U {xj}, we 
conclude 

I{ndj : Xj\paj) = ^ R{xa\paa) + R{xj\paj) - ^ R{xa\paa) = 0. 

(3) (1) holds because the non-descendants of a node are d-separated from the node itself by the 
parents. 

(1) — )• (3). Since the dependence measure / satisfies the graphoid axioms (Lemma|4l) we can apply 
Theorem 2 in Verma & Pearl |[25l which asserts that the DAG is an /-map, or in other words that 
d-separation relations represent a subset of the (conditional) independences that hold for the given 
objects. □ 
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B Proof of Theorem |3] 



We only state the prove for the monotonely faithful case, since the proof in the faithful case is similar 
and has already been given in ||26l . The basic ingredient is Lemma A5 in |[26ll . stating that 

(*) any two nodes a and b that are not adjacent in a DAG can be d-separated by a set of nodes Sab 
not containing a and b. 

Let the observations X be faithfully represented by a graph Gj. We show first that Gj fulfills (1) and 
(2): Since d-separation implies conditional independence in any causal model (Theorem[T]), (*) proves 
one direction of (1). The converse direction follows because if a and b could be made independent by 
conditioning on the join of nodes in a set Sab, then there exists a minimal set with this property. 
By definition of a monotonely faithful representation, d-separates a and b, hence they can not 
be adjacent. To see (2) let a, 6, c be nodes with the required adjacencies and assume a — )• 6 c 
in Gf. Since a and c are not adjacent, by (*) there exists a minimal set U that d-separates a and c. 
Now b is not a member of U and d-separation implies the desired independence. Conversely, assume 
there exists a set U such that a JL c\u, where u is the join of the nodes in U. We can choose U to 
be minimal with this property, thus monotone faithfulness of Gf implies that U d-separates a and c. 
Because b ^ Uby assumption, the graphs a — > 6 — > c and a 6 cannot be subgraphs of Gf. 
Now let G be a DAG that fulfills (1) and (2). We need to show that G is a monotonely faithful 
representation of the observations in X. By assumption there exists a monotonely faithful DAG Gf 
that, as we have seen, also fulfills (1) and (2). In particulars (1) implies that G and Gf do have the 
same adjacencies and (2) implies that they have the same subgraphs of the form a — )> 6 c (2). Then 
a graph-theoretical argument (Lemma ^.8 in |[26l ) states that G and Gf imply the same d-separation 
relations, hence by definition G is a monotonely faithfully representation, too. 
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